From Semantic Roles to Opinion Roles: SRL Data Extraction for Multi-Task and Transfer Learning in Low-Resource ORL
- URL: http://arxiv.org/abs/2511.08537v1
- Date: Wed, 12 Nov 2025 02:03:41 GMT
- Title: From Semantic Roles to Opinion Roles: SRL Data Extraction for Multi-Task and Transfer Learning in Low-Resource ORL
- Authors: Amirmohammad Omidi Galdiani, Sepehr Rezaei Melal, Mohammad Norasteh, Arash Yousefi Jordehi, Seyed Abolghasem Mirroshandel,
- Abstract summary: This report presents a methodology for constructing a high-quality Semantic Role Labeling (SRL) dataset from the Wall Street Journal (WSJ) portion of the OntoNotes 5.0 corpus.<n>We implement a reproducible extraction pipeline that aligns predicate-argument structures with surface text, converts syntactic tree pointers to coherent spans, and applies rigorous cleaning to ensure semantic fidelity.<n>The resulting dataset comprises 97,169 predicate-argument instances with clearly defined Agent (ARG0), Predicate (REL), and Patient (ARG1) roles, mapped to ORL's Holder, Expression, and Target schema.
- Score: 3.2641459166493405
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This report presents a detailed methodology for constructing a high-quality Semantic Role Labeling (SRL) dataset from the Wall Street Journal (WSJ) portion of the OntoNotes 5.0 corpus and adapting it for Opinion Role Labeling (ORL) tasks. Leveraging the PropBank annotation framework, we implement a reproducible extraction pipeline that aligns predicate-argument structures with surface text, converts syntactic tree pointers to coherent spans, and applies rigorous cleaning to ensure semantic fidelity. The resulting dataset comprises 97,169 predicate-argument instances with clearly defined Agent (ARG0), Predicate (REL), and Patient (ARG1) roles, mapped to ORL's Holder, Expression, and Target schema. We provide a detailed account of our extraction algorithms, discontinuous argument handling, annotation corrections, and statistical analysis of the resulting dataset. This work offers a reusable resource for researchers aiming to leverage SRL for enhancing ORL, especially in low-resource opinion mining scenarios.
Related papers
- ReFuGe: Feature Generation for Prediction Tasks on Relational Databases with LLM Agents [33.930224200799366]
ReFuGe is an agentic framework for generating predictive features on RDBs.<n>It operates within an iterative feedback loop until performance converges.<n> Experiments on RDB benchmarks demonstrate that ReFuGe substantially improves performance on various RDB prediction tasks.
arXiv Detail & Related papers (2026-01-25T08:02:29Z) - Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction [0.18907108368038208]
Current methods lack the adaptive policies needed to dynamically debug queries based on real-time execution feedback.<n>This paper introduces a novel agentic framework where an LLM learns a resilient policy for the sequential process of iterative SPARQL construction.<n>We show that a compact 3B- parameter model, trained exclusively via outcome-driven Reinforcement Learning (GRPO), can learn effective policies for this task.
arXiv Detail & Related papers (2025-11-14T08:44:58Z) - Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties [6.295923933999817]
Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes.<n>This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction.
arXiv Detail & Related papers (2025-11-05T12:16:51Z) - Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision [50.45597801390757]
Instruct-LF is a goal-oriented latent factor discovery system.<n>It integrates instruction-following ability with statistical models to handle noisy datasets.
arXiv Detail & Related papers (2025-02-21T02:03:08Z) - Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs [12.878608250420832]
Retrieval-augmented generation (RAG) has revitalized Large Language Models (LLMs)<n>We propose $textitgraph of records$ ($textbfGoR$) to enhance RAG for long-context global summarization.<n>GoR features a $textitgraph neural network$ and an elaborately designed $textitBERTScore$-based objective for self-supervised model training.
arXiv Detail & Related papers (2024-10-14T18:34:29Z) - Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - How Good are LLMs at Relation Extraction under Low-Resource Scenario? Comprehensive Evaluation [7.151108031568037]
This paper constructs low-resource relation extraction datasets in 10 low-resource languages (LRLs) in three regions (Central Asia, Southeast Asia and Middle East)
The corpora are constructed by translating the original publicly available English RE datasets (NYT10, FewRel and CrossRE) using an effective multilingual machine translation.
Then, we use the language perplexity (PPL) to filter out the low-quality data from the translated datasets.
arXiv Detail & Related papers (2024-06-17T03:02:04Z) - Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation [128.01050030936028]
We propose an information refinement training method named InFO-RAG.
InFO-RAG is low-cost and general across various tasks.
It improves the performance of LLaMA2 by an average of 9.39% relative points.
arXiv Detail & Related papers (2024-02-28T08:24:38Z) - Semantic Role Labeling Meets Definition Modeling: Using Natural Language
to Describe Predicate-Argument Structures [104.32063681736349]
We present an approach to describe predicate-argument structures using natural language definitions instead of discrete labels.
Our experiments and analyses on PropBank-style and FrameNet-style, dependency-based and span-based SRL also demonstrate that a flexible model with an interpretable output does not necessarily come at the expense of performance.
arXiv Detail & Related papers (2022-12-02T11:19:16Z) - Semantic Role Labeling as Syntactic Dependency Parsing [19.919191146167584]
Three common syntactic patterns account for over 98% of the PropBank-style semantic role labeling annotations.
We present a conversion scheme that packs SRL annotations into dependency tree representations through joint labels.
arXiv Detail & Related papers (2020-10-21T17:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.