Transfer Learning with Synthetic Corpora for Spatial Role Labeling and
Reasoning
- URL: http://arxiv.org/abs/2210.16952v1
- Date: Sun, 30 Oct 2022 21:23:34 GMT
- Title: Transfer Learning with Synthetic Corpora for Spatial Role Labeling and
Reasoning
- Authors: Roshanak Mirzaee and Parisa Kordjamshidi
- Abstract summary: We provide two new data resources on multiple spatial language processing tasks.
The first dataset is synthesized for transfer learning on spatial question answering (SQA) and spatial role labeling (SpRL)
The second dataset is a real-world SQA dataset with human-generated questions built on an existing corpus with SPRL annotations.
- Score: 15.082041039434365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research shows synthetic data as a source of supervision helps
pretrained language models (PLM) transfer learning to new target tasks/domains.
However, this idea is less explored for spatial language. We provide two new
data resources on multiple spatial language processing tasks. The first dataset
is synthesized for transfer learning on spatial question answering (SQA) and
spatial role labeling (SpRL). Compared to previous SQA datasets, we include a
larger variety of spatial relation types and spatial expressions. Our data
generation process is easily extendable with new spatial expression lexicons.
The second one is a real-world SQA dataset with human-generated questions built
on an existing corpus with SPRL annotations. This dataset can be used to
evaluate spatial language processing models in realistic situations. We show
pretraining with automatically generated data significantly improves the SOTA
results on several SQA and SPRL benchmarks, particularly when the training data
in the target domain is small.
Related papers
- Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing [6.074150063191985]
Cross-Lingual Back-Parsing is a novel data augmentation methodology designed to enhance cross-lingual transfer for semantic parsing.
Our methodology effectively performs cross-lingual data augmentation in challenging zero-resource settings.
arXiv Detail & Related papers (2024-10-01T08:53:38Z) - Adapting a Foundation Model for Space-based Tasks [16.81793096235458]
In the future of space robotics, we see three core challenges which motivate the use of a foundation model adapted to space-based applications.
In this work, we demonstrate that 1) existing vision-language models are deficient visual reasoners in space-based applications, and 2) fine-tuning a vision-language model on extraterrestrial data significantly improves the quality of responses.
arXiv Detail & Related papers (2024-08-12T05:07:24Z) - How Can Large Language Models Understand Spatial-Temporal Data? [12.968952073740796]
This paper introduces STG-LLM, an innovative approach empowering Large Language Models for spatial-temporal forecasting.
We tackle the data mismatch by proposing: 1) STG-Tokenizer: This spatial-temporal graph tokenizer transforms intricate graph data into concise tokens capturing both spatial and temporal relationships; 2) STG-Adapter: This minimalistic adapter, consisting of linear encoding and decoding layers, bridges the gap between tokenized data and LLM comprehension.
arXiv Detail & Related papers (2024-01-25T14:03:15Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting.
Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained.
Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z) - CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing [9.338266891598973]
CLASP generates synthetic data from AlexaTM 20B to augment the training set for a model 40x smaller (500M parameters)
We evaluate on two datasets in low-resource settings: English PIZZA, containing either 348 or 16 real examples, and mTOP cross-lingual zero-shot, where training data is available only in English.
arXiv Detail & Related papers (2022-10-13T15:01:03Z) - InPars: Data Augmentation for Information Retrieval using Large Language
Models [5.851846467503597]
In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for information retrieval tasks.
We show that models finetuned solely on our unsupervised dataset outperform strong baselines such as BM25.
retrievers finetuned on both supervised and our synthetic data achieve better zero-shot transfer than models finetuned only on supervised data.
arXiv Detail & Related papers (2022-02-10T16:52:45Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.