SNaRe: Domain-aware Data Generation for Low-Resource Event Detection
- URL: http://arxiv.org/abs/2502.17394v2
- Date: Thu, 05 Jun 2025 15:45:00 GMT
- Title: SNaRe: Domain-aware Data Generation for Low-Resource Event Detection
- Authors: Tanmay Parekh, Yuxuan Dong, Lucas Bandarkar, Artin Kim, I-Hung Hsu, Kai-Wei Chang, Nanyun Peng,
- Abstract summary: Event Detection is critical for enabling reasoning in highly specialized domains such as biomedicine, law, and epidemiology.<n>We introduce SNaRe, a domain-aware synthetic data generation framework composed of three components: Scout, Narrator, and Refiner.<n>Scout extracts triggers from unlabeled target domain data and curates a high-quality domain-specific trigger list.<n>Narrator, conditioned on these triggers, generates high-quality domain-aligned sentences, and Refiner identifies additional event mentions.
- Score: 84.82139313614255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Event Detection (ED) -- the task of identifying event mentions from natural language text -- is critical for enabling reasoning in highly specialized domains such as biomedicine, law, and epidemiology. Data generation has proven to be effective in broadening its utility to wider applications without requiring expensive expert annotations. However, when existing generation approaches are applied to specialized domains, they struggle with label noise, where annotations are incorrect, and domain drift, characterized by a distributional mismatch between generated sentences and the target domain. To address these issues, we introduce SNaRe, a domain-aware synthetic data generation framework composed of three components: Scout, Narrator, and Refiner. Scout extracts triggers from unlabeled target domain data and curates a high-quality domain-specific trigger list using corpus-level statistics to mitigate domain drift. Narrator, conditioned on these triggers, generates high-quality domain-aligned sentences, and Refiner identifies additional event mentions, ensuring high annotation quality. Experimentation on three diverse domain ED datasets reveals how SNaRe outperforms the best baseline, achieving average F1 gains of 3-7% in the zero-shot/few-shot settings and 4-20% F1 improvement for multilingual generation. Analyzing the generated trigger hit rate and human evaluation substantiates SNaRe's stronger annotation quality and reduced domain drift.
Related papers
- TechniqueRAG: Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text [11.417612899344697]
Accurately identifying adversarial techniques in security texts is critical for effective cyber defense.<n>Existing methods face a fundamental trade-off: they either rely on generic models with limited domain precision or require resource-intensive pipelines.<n>We propose TechniqueRAG, a domain-specific retrieval-augmented generation (RAG) framework that bridges this gap by integrating off-the-shelf retrievers, instruction-tuned LLMs, and minimal text-technique pairs.
arXiv Detail & Related papers (2025-05-17T12:46:10Z) - Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation [66.66243874361103]
dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data.
We propose Concept-Aware LoRA, a novel fine-tuning approach that selectively identifies and updates only the weights associated with necessary concepts for domain alignment.
We demonstrate its effectiveness in generating datasets for urban-scene segmentation, outperforming baseline and state-of-the-art methods in in-domain settings.
arXiv Detail & Related papers (2025-03-28T06:23:29Z) - Data-Efficient CLIP-Powered Dual-Branch Networks for Source-Free Unsupervised Domain Adaptation [4.7589762171821715]
Source-free Unsupervised Domain Adaptation (SF-UDA) aims to transfer a model's performance from a labeled source domain to an unlabeled target domain without direct access to source samples.
We introduce a data-efficient, CLIP-powered dual-branch network (CDBN) to address the dual challenges of limited source data and privacy concerns.
CDBN achieves near state-of-the-art performance with far fewer source domain samples than existing methods across 31 transfer tasks on seven datasets.
arXiv Detail & Related papers (2024-10-21T09:25:49Z) - Gradual Source Domain Expansion for Unsupervised Domain Adaptation [45.207132297204424]
Unsupervised domain adaptation (UDA) tries to overcome the need for a large labeled dataset by transferring knowledge from a source dataset to a target dataset.
We propose a gradual source domain expansion (GSDE) algorithm to overcome this problem.
GSDE trains the UDA task several times from scratch, each time reinitializing the network weights, but each time expands the source dataset with target data.
arXiv Detail & Related papers (2023-11-16T06:18:35Z) - Multi-scale Feature Alignment for Continual Learning of Unlabeled
Domains [3.9498537297431167]
generative feature-driven image replay in conjunction with a dual-purpose discriminator enables the generation of images with realistic features for replay.
We present detailed ablation experiments studying our proposed method components and demonstrate a possible use-case of our continual UDA method for an unsupervised patch-based segmentation task.
arXiv Detail & Related papers (2023-02-02T18:19:01Z) - Cyclically Disentangled Feature Translation for Face Anti-spoofing [61.70377630461084]
We propose a novel domain adaptation method called cyclically disentangled feature translation network (CDFTN)
CDFTN generates pseudo-labeled samples that possess: 1) source domain-invariant liveness features and 2) target domain-specific content features, which are disentangled through domain adversarial training.
A robust classifier is trained based on the synthetic pseudo-labeled images under the supervision of source domain labels.
arXiv Detail & Related papers (2022-12-07T14:12:34Z) - Deep Unsupervised Domain Adaptation: A Review of Recent Advances and
Perspectives [16.68091981866261]
Unsupervised domain adaptation (UDA) is proposed to counter the performance drop on data in a target domain.
UDA has yielded promising results on natural image processing, video analysis, natural language processing, time-series data analysis, medical image analysis, etc.
arXiv Detail & Related papers (2022-08-15T20:05:07Z) - Domain-Agnostic Prior for Transfer Semantic Segmentation [197.9378107222422]
Unsupervised domain adaptation (UDA) is an important topic in the computer vision community.
We present a mechanism that regularizes cross-domain representation learning with a domain-agnostic prior (DAP)
Our research reveals that UDA benefits much from better proxies, possibly from other data modalities.
arXiv Detail & Related papers (2022-04-06T09:13:25Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency [90.71745178767203]
Deep learning-based 3D object detection has achieved unprecedented success with the advent of large-scale autonomous driving datasets.
Existing 3D domain adaptive detection methods often assume prior access to the target domain annotations, which is rarely feasible in the real world.
We study a more realistic setting, unsupervised 3D domain adaptive detection, which only utilizes source domain annotations.
arXiv Detail & Related papers (2021-07-23T17:19:23Z) - A Curriculum-style Self-training Approach for Source-Free Semantic Segmentation [91.13472029666312]
We propose a curriculum-style self-training approach for source-free domain adaptive semantic segmentation.
Our method yields state-of-the-art performance on source-free semantic segmentation tasks for both synthetic-to-real and adverse conditions.
arXiv Detail & Related papers (2021-06-22T10:21:39Z) - Disentanglement-based Cross-Domain Feature Augmentation for Effective
Unsupervised Domain Adaptive Person Re-identification [87.72851934197936]
Unsupervised domain adaptive (UDA) person re-identification (ReID) aims to transfer the knowledge from the labeled source domain to the unlabeled target domain for person matching.
One challenge is how to generate target domain samples with reliable labels for training.
We propose a Disentanglement-based Cross-Domain Feature Augmentation strategy.
arXiv Detail & Related papers (2021-03-25T15:28:41Z) - Generation for adaption: a Gan-based approach for 3D Domain Adaption
inPoint Cloud [10.614067060304919]
Unsupervised domain adaptation (UDA) seeks to overcome such a problem without target domain labels.
We propose a method that use a Generative adversarial network to generate synthetic data from the source domain.
Experiments show that our approach performs better than other state-of-the-art UDA methods in three popular 3D object/scene datasets.
arXiv Detail & Related papers (2021-02-15T07:24:10Z) - Curriculum CycleGAN for Textual Sentiment Domain Adaptation with
Multiple Sources [68.31273535702256]
We propose a novel instance-level MDA framework, named curriculum cycle-consistent generative adversarial network (C-CycleGAN)
C-CycleGAN consists of three components: (1) pre-trained text encoder which encodes textual input from different domains into a continuous representation space, (2) intermediate domain generator with curriculum instance-level adaptation which bridges the gap across source and target domains, and (3) task classifier trained on the intermediate domain for final sentiment classification.
We conduct extensive experiments on three benchmark datasets and achieve substantial gains over state-of-the-art DA approaches.
arXiv Detail & Related papers (2020-11-17T14:50:55Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Unsupervised Domain Adaptation for Person Re-Identification through
Source-Guided Pseudo-Labeling [2.449909275410288]
Person Re-Identification (re-ID) aims at retrieving images of the same person taken by different cameras.
Unsupervised Domain Adaptation (UDA) is an interesting research direction for this challenge as it avoids a costly annotation of the target data.
We introduce a framework which relies on a two-branch architecture optimizing classification and triplet loss based metric learning in source and target domains.
arXiv Detail & Related papers (2020-09-20T14:54:42Z) - Inductive Unsupervised Domain Adaptation for Few-Shot Classification via
Clustering [16.39667909141402]
Few-shot classification tends to struggle when it needs to adapt to diverse domains.
We introduce a framework, DaFeC, to improve Domain adaptation performance for Few-shot classification via Clustering.
Our approach outperforms previous work with absolute gains (in classification accuracy) of 4.95%, 9.55%, 3.99% and 11.62%, respectively.
arXiv Detail & Related papers (2020-06-23T08:17:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.