Related papers: Predicting Implicit Arguments in Procedural Video Instructions

Predicting Implicit Arguments in Procedural Video Instructions

URL: http://arxiv.org/abs/2505.21068v1
Date: Tue, 27 May 2025 11:53:06 GMT
Title: Predicting Implicit Arguments in Procedural Video Instructions
Authors: Anil Batra, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller,
Abstract summary: Implicit-VidSRL is a dataset that necessitates inferring implicit and explicit arguments from contextual information in cooking procedures.<n>We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb.<n>We propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.
Score: 31.927805750607536
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like {verb,what,where/with}. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step's where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models' contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb. Lastly, we propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.

Related papers

Context-Aware Hierarchical Merging for Long Document Summarization [56.96619074316232]
We propose different approaches to enrich hierarchical merging with context from the source document.<n> Experimental results on datasets representing legal and narrative domains show that contextual augmentation consistently outperforms zero-shot and hierarchical merging baselines.
arXiv Detail & Related papers (2025-02-03T01:14:31Z)
BP4ER: Bootstrap Prompting for Explicit Reasoning in Medical Dialogue Generation [31.40174974440382]
Medical dialogue generation (MDG) has gained increasing attention due to its substantial practical value. We propose the method Bootstrap Prompting for Explicit Reasoning in MDG (BP4ER) BP4ER explicitly model MDG's multi-step reasoning process and iteratively enhance this reasoning process. The experimental findings on the two public datasets indicate that BP4ER outperforms state-of-the-art methods in terms of both objective and subjective evaluation metrics.
arXiv Detail & Related papers (2024-03-28T13:38:13Z)
Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval. We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning. On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z)
Infusing Hierarchical Guidance into Prompt Tuning: A Parameter-Efficient Framework for Multi-level Implicit Discourse Relation Recognition [16.647413058592125]
Multi-level implicit discourse relation recognition (MIDRR) aims at identifying hierarchical discourse relations among arguments. In this paper, we propose a prompt-based. Efficient Multi-level IDRR (PEMI) framework to solve the above problems.
arXiv Detail & Related papers (2024-02-23T03:53:39Z)
ULTRA: Unleash LLMs' Potential for Event Argument Extraction through Hierarchical Modeling and Pair-wise Self-Refinement [6.035020544588768]
Event argument extraction (EAE) is the task of identifying role-specific text spans (i.e., arguments) for a given event.<n>We propose a hierarchical framework that extracts event arguments more cost-effectively.<n>We introduce LEAFER to address the challenge LLMs face in locating the exact boundary of an argument.
arXiv Detail & Related papers (2024-01-24T04:13:28Z)
Instruction Position Matters in Sequence Generation with Large Language Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization. We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z)
Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z)
IAM: A Comprehensive and Large-Scale Dataset for Integrated Argument Mining Tasks [59.457948080207174]
In this work, we introduce a comprehensive and large dataset named IAM, which can be applied to a series of argument mining tasks. Near 70k sentences in the dataset are fully annotated based on their argument properties. We propose two new integrated argument mining tasks associated with the debate preparation process: (1) claim extraction with stance classification (CESC) and (2) claim-evidence pair extraction (CEPE)
arXiv Detail & Related papers (2022-03-23T08:07:32Z)
Learning to Ask Conversational Questions by Optimizing Levenshtein Distance [83.53855889592734]
We introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimize the minimum Levenshtein distance (MLD) through explicit editing actions. RISE is able to pay attention to tokens that are related to conversational characteristics. Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-06-30T08:44:19Z)
Great Service! Fine-grained Parsing of Implicit Arguments [7.785534704637891]
We show that certain types of implicit arguments are more difficult to parse than others. This work will facilitate a better understanding of implicit and underspecified language, by incorporating it holistically into meaning representations.
arXiv Detail & Related papers (2021-06-04T15:50:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.