Related papers: LLM-powered Query Expansion for Enhancing Boundary Prediction in Language-driven Action Localization

LLM-powered Query Expansion for Enhancing Boundary Prediction in Language-driven Action Localization

URL: http://arxiv.org/abs/2505.24282v1
Date: Fri, 30 May 2025 06:59:35 GMT
Title: LLM-powered Query Expansion for Enhancing Boundary Prediction in Language-driven Action Localization
Authors: Zirui Shang, Xinxiao Wu, Shuo Yang,
Abstract summary: Language-driven action localization in videos requires semantic alignment between language query and video segment.<n>We propose to expand the original query by generating textual descriptions of the action start and end boundaries.<n>We also propose to model probability scores of action boundaries by calculating the semantic similarities between frames and the expanded query.<n>Our method is model-agnostic and can be seamlessly integrated into any existing models of language-driven action localization.
Score: 25.103269229541564
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language-driven action localization in videos requires not only semantic alignment between language query and video segment, but also prediction of action boundaries. However, the language query primarily describes the main content of an action and usually lacks specific details of action start and end boundaries, which increases the subjectivity of manual boundary annotation and leads to boundary uncertainty in training data. In this paper, on one hand, we propose to expand the original query by generating textual descriptions of the action start and end boundaries through LLMs, which can provide more detailed boundary cues for localization and thus reduce the impact of boundary uncertainty. On the other hand, to enhance the tolerance to boundary uncertainty during training, we propose to model probability scores of action boundaries by calculating the semantic similarities between frames and the expanded query as well as the temporal distances between frames and the annotated boundary frames. They can provide more consistent boundary supervision, thus improving the stability of training. Our method is model-agnostic and can be seamlessly and easily integrated into any existing models of language-driven action localization in an off-the-shelf manner. Experimental results on several datasets demonstrate the effectiveness of our method.

Related papers

EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model [63.93372634950661]
We propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries.
arXiv Detail & Related papers (2023-12-05T04:15:56Z)
Boundary-Aware Proposal Generation Method for Temporal Action Localization [23.79359799496947]
TAL aims to find the categories and temporal boundaries of actions in an untrimmed video. Most TAL methods rely heavily on action recognition models that are sensitive to action labels rather than temporal boundaries. We propose a Boundary-Aware Proposal Generation (BAPG) method with contrastive learning.
arXiv Detail & Related papers (2023-09-25T01:41:09Z)
Temporal Action Localization with Enhanced Instant Discriminability [66.76095239972094]
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods. Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
arXiv Detail & Related papers (2023-09-11T16:17:50Z)
Video Activity Localisation with Uncertainties in Temporal Boundary [74.7263952414899]
Methods for video activity localisation over time assume implicitly that activity temporal boundaries are determined and precise. In unscripted natural videos, different activities transit smoothly, so that it is intrinsically ambiguous to determine in labelling precisely when an activity starts and ends over time. We introduce Elastic Moment Bounding (EMB) to accommodate flexible and adaptive activity temporal boundaries.
arXiv Detail & Related papers (2022-06-26T16:45:56Z)
Boundary Guided Context Aggregation for Semantic Segmentation [23.709865471981313]
We exploit boundary as a significant guidance for context aggregation to promote the overall semantic understanding of an image. We conduct extensive experiments on the Cityscapes and ADE20K databases, and comparable results are achieved with the state-of-the-art methods.
arXiv Detail & Related papers (2021-10-27T17:04:38Z)
Learning Salient Boundary Feature for Anchor-free Temporal Action Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding. We propose the first purely anchor-free temporal localization method. Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z)
Active Boundary Loss for Semantic Segmentation [58.72057610093194]
This paper proposes a novel active boundary loss for semantic segmentation. It can progressively encourage the alignment between predicted boundaries and ground-truth boundaries during end-to-end training. Experimental results show that training with the active boundary loss can effectively improve the boundary F-score and mean Intersection-over-Union.
arXiv Detail & Related papers (2021-02-04T15:47:54Z)
Boundary-sensitive Pre-training for Temporal Localization in Videos [124.40788524169668]
We investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext ( BSP) task. With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types. Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart.
arXiv Detail & Related papers (2020-11-21T17:46:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.