Weak Supervision and Referring Attention for Temporal-Textual
Association Learning
- URL: http://arxiv.org/abs/2006.11747v2
- Date: Sat, 27 Jun 2020 08:19:35 GMT
- Title: Weak Supervision and Referring Attention for Temporal-Textual
Association Learning
- Authors: Zhiyuan Fang, Shu Kong, Zhe Wang, Charless Fowlkes, Yezhou Yang
- Abstract summary: We propose a Weak-Supervised alternative to learn temporal-textual association (dubbed WSRA)
The weak supervision is simply a textual expression at video level, indicating this video contains relevant frames.
The referring attention is our designed mechanism acting as a scoring function for grounding the given queries over frames temporally.
We validate our WSRA through extensive experiments for temporally grounding by languages, demonstrating that it outperforms the state-of-the-art weakly-supervised methods notably.
- Score: 35.469984595398905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A system capturing the association between video frames and textual queries
offer great potential for better video analysis. However, training such a
system in a fully supervised way inevitably demands a meticulously curated
video dataset with temporal-textual annotations. Therefore we provide a
Weak-Supervised alternative with our proposed Referring Attention mechanism to
learn temporal-textual association (dubbed WSRA). The weak supervision is
simply a textual expression (e.g., short phrases or sentences) at video level,
indicating this video contains relevant frames. The referring attention is our
designed mechanism acting as a scoring function for grounding the given queries
over frames temporally. It consists of multiple novel losses and sampling
strategies for better training. The principle in our designed mechanism is to
fully exploit 1) the weak supervision by considering informative and
discriminative cues from intra-video segments anchored with the textual query,
2) multiple queries compared to the single video, and 3) cross-video visual
similarities. We validate our WSRA through extensive experiments for temporally
grounding by languages, demonstrating that it outperforms the state-of-the-art
weakly-supervised methods notably.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Straight to the Point: Fast-forwarding Videos via Reinforcement Learning
Using Textual Data [1.004766879203303]
We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos.
Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video.
We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
arXiv Detail & Related papers (2020-03-31T14:07:45Z) - Weakly-Supervised Multi-Level Attentional Reconstruction Network for
Grounding Textual Queries in Videos [73.4504252917816]
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query.
Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios.
We present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage.
arXiv Detail & Related papers (2020-03-16T07:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.