BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos
- URL: http://arxiv.org/abs/2312.00083v2
- Date: Thu, 18 Jul 2024 11:01:46 GMT
- Title: BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos
- Authors: Pilhyeon Lee, Hyeran Byun,
- Abstract summary: temporal sentence grounding aims to localize moments relevant to a language description.
We propose a novel boundary-oriented moment formulation.
Experiments on three benchmarks validate the effectiveness of the proposed methods.
- Score: 19.280799998526636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches achieved notable progress by predicting the center and length of a target moment. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we propose a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the boundaries are directly estimated. Based on this idea, we design a boundary-aligned moment detection transformer, equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Experiments on three benchmarks validate the effectiveness of the proposed methods. The code is available at https://github.com/Pilhyeon/BAM-DETR.
Related papers
- Dynamic Position Transformation and Boundary Refinement Network for Left Atrial Segmentation [17.09918110723713]
Left atrial (LA) segmentation is a crucial technique for irregular heartbeat (i.e., atrial fibrillation) diagnosis.
Most current methods for LA segmentation strictly assume that the input data is acquired using object-oriented center cropping.
We propose a novel Dynamic Position transformation and Boundary refinement Network (DPBNet) to tackle these issues.
arXiv Detail & Related papers (2024-07-07T22:09:35Z) - FRAME: A Modular Framework for Autonomous Map Merging: Advancements in the Field [12.247977717070773]
This article presents a novel approach for merging 3D point cloud maps in the context of egocentric multi-robot exploration.
The proposed approach leverages state-of-the-art place recognition and learned descriptors to efficiently detect overlap between maps.
The effectiveness of the proposed framework is successfully demonstrated through multiple field missions of robot exploration.
arXiv Detail & Related papers (2024-04-27T20:54:15Z) - Centre Stage: Centricity-based Audio-Visual Temporal Action Detection [26.42447737005981]
We explore strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities.
We propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score.
arXiv Detail & Related papers (2023-11-28T03:02:00Z) - Temporal Action Localization with Enhanced Instant Discriminability [66.76095239972094]
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video.
We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods.
Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
arXiv Detail & Related papers (2023-09-11T16:17:50Z) - Implicit and Efficient Point Cloud Completion for 3D Single Object
Tracking [9.372859423951349]
We introduce two novel modules, i.e., Adaptive Refine Prediction (ARP) and Target Knowledge Transfer (TKT)
Our model achieves state-of-the-art performance while maintaining a lower computational consumption.
arXiv Detail & Related papers (2022-09-01T15:11:06Z) - Semi-Supervised Temporal Action Detection with Proposal-Free Masking [134.26292288193298]
We propose a novel Semi-supervised Temporal action detection model based on PropOsal-free Temporal mask (SPOT)
SPOT outperforms state-of-the-art alternatives, often by a large margin.
arXiv Detail & Related papers (2022-07-14T16:58:47Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Point-Level Temporal Action Localization: Bridging Fully-supervised
Proposals to Weakly-supervised Losses [84.2964408497058]
Point-level temporal action localization (PTAL) aims to localize actions in untrimmed videos with only one timestamp annotation for each action instance.
Existing methods adopt the frame-level prediction paradigm to learn from the sparse single-frame labels.
This paper attempts to explore the proposal-based prediction paradigm for point-level annotations.
arXiv Detail & Related papers (2020-12-15T12:11:48Z) - Making Affine Correspondences Work in Camera Geometry Computation [62.7633180470428]
Local features provide region-to-region rather than point-to-point correspondences.
We propose guidelines for effective use of region-to-region matches in the course of a full model estimation pipeline.
Experiments show that affine solvers can achieve accuracy comparable to point-based solvers at faster run-times.
arXiv Detail & Related papers (2020-07-20T12:07:48Z) - Robust 6D Object Pose Estimation by Learning RGB-D Features [59.580366107770764]
We propose a novel discrete-continuous formulation for rotation regression to resolve this local-optimum problem.
We uniformly sample rotation anchors in SO(3), and predict a constrained deviation from each anchor to the target, as well as uncertainty scores for selecting the best prediction.
Experiments on two benchmarks: LINEMOD and YCB-Video, show that the proposed method outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2020-02-29T06:24:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.