DemaFormer: Damped Exponential Moving Average Transformer with
Energy-Based Modeling for Temporal Language Grounding
- URL: http://arxiv.org/abs/2312.02549v1
- Date: Tue, 5 Dec 2023 07:37:21 GMT
- Title: DemaFormer: Damped Exponential Moving Average Transformer with
Energy-Based Modeling for Temporal Language Grounding
- Authors: Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Cong-Duy Nguyen, See-Kiong
Ng, Luu Anh Tuan
- Abstract summary: Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query.
We propose an energy-based model framework to explicitly learn moment-query distributions.
We also propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor.
- Score: 32.45280955448672
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal Language Grounding seeks to localize video moments that semantically
correspond to a natural language query. Recent advances employ the attention
mechanism to learn the relations between video moments and the text query.
However, naive attention might not be able to appropriately capture such
relations, resulting in ineffective distributions where target video moments
are difficult to separate from the remaining ones. To resolve the issue, we
propose an energy-based model framework to explicitly learn moment-query
distributions. Moreover, we propose DemaFormer, a novel Transformer-based
architecture that utilizes exponential moving average with a learnable damping
factor to effectively encode moment-query inputs. Comprehensive experiments on
four public temporal language grounding datasets showcase the superiority of
our methods over the state-of-the-art baselines.
Related papers
- Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - End-to-End Dense Video Grounding via Parallel Regression [30.984657885692553]
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query.
We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG)
Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
arXiv Detail & Related papers (2021-09-23T10:03:32Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.